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Abstract 

Recent years have seen an explosion of domain-specific ac¬ 
celerator for Convolutional Neural Networks (CNN). Most of 
the prior CNN accelerators target neural networks on image 
recognition, such as AlexNet, VGG, GoogleNet, ResNet, etc. 
In this paper, we take a different route and study the acceler¬ 
ation of 3D CNN, which are more computational-intensive 
than 2D CNN and exhibits more opportunities. After our 
characterization on representative 3D CNNs, we leverage dif¬ 
ferential convolution across the temporal dimension, which 
operates on the temporal delta of imaps for each layer and 
process the computation bit-serially using only the effectual 
bits of the temporal delta. To further leverage the spatial lo¬ 
cality and temporal locality, and make the architecture gen¬ 
eral to all CNNs, we propose a control mechanism to dynam¬ 
ically switch across spatial delta dataflow and temporal delta 
dataflow. We call our design temporal-spatial value aware ac¬ 
celerator (TSVA). Evaluation on a set of representation NN 
networks shows that TSVA can achieve an average of 4.24x 
speedup and 1.42x energy efficiency. While we target 3D 
CNN for video recognition, TSVA could also benefit other 
general CNNs for continuous batch processing. 

1. Introduction 

The end of Moore’s law [1] and Dennard scaling [2], and 
the consequently dark silicon phenomenon [3] has led to the 
end of rapid improvement of general-purpose program per¬ 
formance. Instead of improving the general-purpose compu¬ 
ting, domain-specific architecture is considered the most ef¬ 
fective way. Recent years, neural networks (NN) are gaining 
lots of interests as they achieve state-of-the-art performance 
on many tasks [4], including image recognition [5], speech 
recognition [6], video understanding [7] and analytics [8] [9], 
autonomous driving [10] [11]. Towards the domain-specific 
acceleration for better performance and energy efficiency, 
various hardware accelerators have been designed 
[12][13][14][15][16][17][18][19][20]. 

Given the recent fast development of neural network on 
image recognition, the deep learning-based approaches have 
also significantly improved video recognition performance. 
Action recognition from videos would require capture con¬ 
text from the entire video (i.e. a sequence of frames) rather 
than just capturing information from each frame [21]. Video 
recognition also received numerous attention from computer 
vision community [22][7][23][24][25][26], with different da¬ 
tasets developed for different domains [27] [28] [29]. Deep 3- 
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Figure 1. A real 3D CNN Model for video action recognition. 


dimensional convolution neural networks (3D-CNN) have 
demonstrated their outstanding classification performance in 
video recognition. 

Video-based 3D CNN inferences the activity based on a 
sequence of frames extracted directly from the video. It in¬ 
volves the identification of different actions across video 
clips (i.e. a sequence of frames) where the action may or may 
not be performed throughout the entire duration of the video 
[21]. It has been tough for the following reasons: (1) High 
computational cost. For instance, a simple 2D convolution 
network for image classification for 101 classes has just ~5M 
parameters, whereas the same architecture inflated to a 3D 
structure results in ~33M parameters [21]. It also takes 3 to 4 
days to train a 3D convolutional neural network on UCF101 
datasets [27] and about two months on Sports-1M [7]. (2) 
Capturing long context action involves capturing spatiotem- 
poral context across frames [21]. There is a local as well as 
global context (motion information) which needs to be cap¬ 
tured for robust predictions. 

The algorithm complexity and memory bandwidth demand 
of 3D CNN put great pressure on the processing speed. Dif¬ 
ferent acceleration options have been leveraged to accelerate 
3D CNN, including FPGAs [30][31], ASIC [32]. 3D CNN 
exhibits some characteristics that challenge the option of di¬ 
rectly applying 2D acceleration option infeasible: (1) work¬ 
ing set size exceeds on-chip memory; (2) 3D CNN is much 
more computation-intensive than memory-intensive based on 
the Roofline model [33]; (3) 3D CNN exhibits more reuse 
opportunities: it also has temporal dimensional data reuse be¬ 
sides spatial reuse. 

Since video processing and analytics is popular in people’s 
daily life, and video traffic is predicted to account for most of 
the internet traffic in the future [34]. Accelerating 3D CNN 
for video recognition will shed light on the development of 
future video analytics, saving power and making users’ life 
easier. 
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Figure 2. Illustration of 3D CONV. The illustration shows 
when stride=l, which is the most common case. 

Thus, this paper explores the acceleration architecture of 
3D CNN. 2D CNN accelerator mainly focuses on designing 
dataflow that exploits the spatial data reuse and across filters 
effectively. For 3D CNN, there is one more dimension: the 
temporal dimension, which indicates the data reuse across 
continuous frames. Since the change of consecutive frames 
sending to a 3D CNN are not arbitrary, there are large data 
overlaps, which exist in not only the first layer as continuous 
input frames, but also the input activations maps (imaps) 
within the intermediate layers of 3D NN. Towards this, we 
first conduct characterization and prove that: (1) there is large 
redundancy between imaps across the temporal dimension; 
(2) quantization of input activations could further increase the 
data overlap. 

Towards this opportunity, we then leverage differential 
convolution across the temporal dimension, which operates 
on the temporal delta of imaps for each layer, and process the 
computation only using the effectual bits of the temporal 
delta. While using only the temporal delta dataflow could not 
benefit all layer’s execution of 3D CNN, and it only benefits 
3D CNN, we then propose a control mechanism to switch be¬ 
tween spatial delta dataflow and temporal delta dataflow. We 
call this temporal-spatial aware accelerator (TSVA). It could 
not only cause higher speedup, but also makes the accelera¬ 
tion architecture general to all CNNs. In summary, we make 
the following contributions: 

First, we study a very important CNN: 3D CNN for video 
recognition. Our characterizations show that large data over¬ 
lap exists between imaps across the temporal dimension, and 
quantization of imaps could increase the overlap. This im¬ 
plies architectural optimization opportunities. 

Second, we leverage the data overlap across the temporal 
dimension and process the computation of effectual items for 
3D CNN efficiently. Then to further leverage both the spatial 
locality and temporal locality that makes our architecture 
general to all CNNs, we propose a dynamic control mecha¬ 
nism to switch between spatial delta dataflow and temporal 
delta dataflow dynamically. 

Third, we evaluate across multiple 3D CNNs and some 
state-of-the-art 2D CNNs by comparing it with other NN ac¬ 
celerators [12][13][35], and show that our accelerator 
achieves an average of x4.24 speedup and xl.42 energy effi¬ 
ciency. 


for m = 0 to M 

for d = 0 to Dout (temporal information: number of frames) 
for h = 0 to Hout (Row) 
for w = 0 to Wout (Column) 
for c = 0 to C 
for t = 0 to T 
for r = 0 to R 
for s = 0 to S do 

Filter=F[m] [c] [t] [r] [s] 

In- 

put=In[c] [d*stride+t] [h *stride+r] [w*stride+s] 

Output[m][d][h][w] += Filter * Input 

Listing 1. 3D Convolution in 3D CNN. 

The rest of this paper is organized as follows: Section 2 
describes the necessary background for 3D CNN, and char¬ 
acterization the data overlap. Section 3 introduces the archi¬ 
tecture design. Section 4 introduces the benefits of our archi¬ 
tecture towards other general 2D CNNs. Section 5 evaluates 
our design. Section 6 concludes the paper. 

2. Background and Motivation 

2.1 3D Convolution Operation 

The computation of 3D CONV is more complicated than 
that of 2D CONV, which can be described as follows: 

C T R S 

0[m\ [h] [w] [c] = im F[m] [c ] [t] [r] [5] * 7n[c] [d 

c =0 t=0 r =0 s=0 

* stride + t] [h * stride + r] [w * stride + s] 

Where F and In represents the MxCxRxSxp dimension fil¬ 
ter and CxDxHx W dimension input activation map. In each 
layer, a set of C input feature maps of size DxHx W are con¬ 
volved by M sets of CxKsize filters (Ksize = RxSxT ), output¬ 
ting M feature maps of size D ou t x H ou t x W ou t while 
Dout=floor((D-T)/stride) + 1, H out = floor ((H-R)/stride) + 1, 
W ou t=floor{{W-S)lstride) + 1. D is the temporal dimension 
which indicates the number of consecutive frame sequences 
sending to the neural network for video recognition. For ex¬ 
ample, D is 16 for C3D network for the first layer’s input. 2D 
convolution is the special case of 3D when D equals to 1 and 
T equals to 1. The computational complexity of the 3D con¬ 
volution is much higher than 2D convolution: 
IxMxCxHxWxRxS for 2D convolution, 

IxMxCxDxHxWxRxSxT for 3D convolution. The pseudo¬ 
code of 3D CONV is shown in Listing 1 and illustrated in 
Figure 2 with stride equals to 1, which indicates the most 
common case (D ou t=D-T+l, H ou t=H-R+ 1, W ou t=W-S+ 1). 

The computation pattern of the FC layers in 3D CNN is a 
matrix-vector multiplication, which is the same as the FC lay¬ 
ers in 2D CNN. CONV layer in 3D CNN is more computa¬ 
tion-intensive and occupies more percentage than in 2D CNN. 
For example, C3D (including RELU) occupies over 99.6% of 
the computation time [31]. (2) the amount of intermediate 
data (i.e. psums) is much larger than the weights (356.7MB 
versus 66.5MB for C3D, even though C3D is still a small- 
scale network compared to other 3D CNN [36] [37]. 
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(a) C3D raw activations 



(c) C3D temporal delta 



(e) C3D spatial delta 



(b) I3D raw activations (d) I3D temporal delta (f) I3D spatial delta 

Figure 3. Sparsity of input activations (a, b), temporal delta (c, d), and spatial delta (e, f). “Avg” is the sparsity across all layers; inc 
stands for inception module [36]. The sparsity across all layers are denoted in the Figures. 


A typical CNN accelerator consists of three levels of hier¬ 
archy: (1) external memory, (2) on-chip buffers, (3) pro¬ 
cessing engines (PEs) and register files. The basic flow is to 
fetch data and weights from the external memory to on-chip 
buffer, and feed them into registers and PEs. After the PE 
computation completes, results are transferred back to on- 
chip buffers and to the external memory if necessary, which 
will be used as the next layer inputs. 

2.2 Activation Sparsity in 3D CNN 

Sparsity in a layer of CNN is defined as the fraction of ze¬ 
ros in the layer’s weight and input activation matrices. 
Weight sparsity mainly comes from the pruning (unstruc¬ 
tured [38][39] or structured [40]) process during training. 
There are already plenty of works on weight pruning algo¬ 
rithms [39][41][42] and hardware accelerators [16][43][12]. 
Activation sparsity occurs dynamically during inference and 
is highly dependent on the data being processed. To measure 
the activation sparsity, we execute two typical 3D CNNs: 
C3D [23] and I3D [37]. We instrument code to inspect the 
activation sparsity of the CONV layers. Quantizing 
[44][45][46][35] the input activations within a specific range 
could reduce the computation overhead with the inference ac¬ 
curacy decreases within the minimal range. 16 bits storage of 
the activations is the baseline, while weights are always 
stored as 16 bits in our paper based on the well-recognized 
16-bit multiplier [13]. Prior works have shown that linear 
quantization [39] to 8 bits [47][48][49][50] does not cause 
accuracy loss, and further quantizing the input activations to 


fewer bits (more or equal than 5) cause less than 1% accuracy 
loss. Our experiment also verified that, we run C3D with a 
pre-trained network model (we didn’t do retraining) and then 
evaluate the accuracy using UCF101 [27] datasets. The accu¬ 
racies are 88.3% for 8 bits, 87.7% for 5 bits (0.7% accuracy 
loss), and 84.5% when further quantizing to 4 bits (4.3% ac¬ 
curacy loss in this case), which shows that quantizing the 
imaps less than 5 bits could cause intolerable accuracy loss. 
I3D also exhibits similar trends. Thus, we choose 16, 8 and 5 
bits to analyze the input activation characteristics. 

Figure 3(a) and 3(b) shows the activation sparsity across 
the CONV layers, referenced to as the y-axis. (1) First, 3D 
CNN has similar activation sparsity characteristics with 2D 
NN [51][17]. For 16 bits, the activation sparsity is 47.2% for 
C3D and 28.9% for I3D. (2) Activation sparsity varies across 
layers, with sparsity typically low in the earlier layers and in¬ 
crease gradually in later layers. (3) With quantization, the 
activation sparsity increases from 47.2% (16 bits) to 56% 
(8bits), and then 67.3% (5 bits) for C3D, and 28.9% (16bits) 
to 30.2% (8bits), and then 34.5% (5 bits) for I3D. 

Prior works have leveraged the activation sparsity to de¬ 
sign hardware accelerators [19][51], either by prediction- 
based [19] or speculation-based method [51]. However, none 
of them considered combining it with temporal locality: the 
high degree of the input activation redundancy across the 
temporal dimension. Since 3D CNN adds the temporal di¬ 
mension D (D is 16 consecutive frames for C3D and 64 for 
I3D), which provides acceleration opportunities. Next, we 
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(e) C3D average effectual bits comparison 



(f) I3D average effectual bits 
Figure 4. Cumulative distribution of the number of effectual terms per activation/delta over C3D and I3D. 


characterize the temporal locality and corresponding 
effectual bits. 

2.3 Input Activation Temporal Locality 

Video is a sequence of consecutive frames. There are large 
overlaps between these consecutive frames, which is the tem¬ 
poral locality. The input activations In[D][C\[H\[W\ are rep¬ 
resented as: In[d\[C\[H\[W\, In[d+\][C\[H\[W\, 

/zz[<i+2][q[//][IF]..., d=( 0,1,2 DA). We define the tem¬ 
poral delta Tdelta as: 

Tdelta = In [d +1 ] [ q [H] [ W] -In [ d] [ C] [H] [ W\ 

If an element in T_delta is zero, the input activations in the 
same position of In[d+\][C\[H\[W\ and In[d][C][H][W] are 
the same. Thus, the sparsity of T_delta represents the over¬ 
lap/redundancy between consecutive input activations 
(In[d+\][C\[H\[W\ and /^[<i][C][/T|[/T]).We characterize the 
sparsity T_delta across all the layers, which are shown in Fig¬ 
ure 3(c) and 3(d). 

(1) First, the sparsity of temporal delta tends to be more 
compared to the raw activations across layers, especially 
when the bits are quantized smaller. Specifically, the first few 
layers tend to have higher sparsity (especially after quantiza¬ 
tion) comparing to raw activations, while the latter layers 
tend to be a little lower. Note that the reason for the low spar¬ 
sity for the first two CONV layers is caused by data normal¬ 
ization. Removing normalization would give higher sparsity 
for 16 bits: 56% on average, with first two layers 56% and 
49.6%; while normalization has less effect on 8- and 5-bits 
quantization, with 77% and 87% respectively. Second, the 
temporal delta sparsity dramatically increases with quantiza¬ 
tion, i.e. quantization makes consecutive imaps more redun¬ 
dant. Third, compared to raw activation, temporal delta tends 
to be sparser with quantization: 56% (8 bits) and 67.3% (5 
bits) for C3D raw activations to 74.7% (8 bits) and 86% (5 
bits) for C3D temporal delta for example. 

Related work [52] leveraged the spatial data locality of ad¬ 
jacent CONV window to design DNN accelerators. We fur¬ 
ther analyze the characteristics of spatial delta (which are 


defined as S_delta=In[D\ [C] [H] [w+l]-In[D] [C] [H] [w], 

w=(0,l,2,.W-l).), and show that temporal delta tends to be 

more sparse compared to spatial delta, as shown in Figure 3(e) 
and 3(f). 

(2) Another insight is that temporal delta needs less 
effectual bits than raw activations. Let’s first give the defini¬ 
tion of effectual bits: a multiplication a*w of an activation a 
with a weight w. If a is represents by p bits, the multiplication 
amounts to adding p terms where the z-th term is the result of 
multiplying the z-th bit of the multiplier a with the shifted by 
z-th bit positions multiplicand w: 

i-V 


a x w = 


^ di ■ (w « i) 


i =0 

It is only those bits of a are 1 that yield effectual work. The 
effectual bits are defined as the bits in a that is 1. Figure 4(a)- 
(d) shows the cumulative distribution of the number of 
effectual terms per raw activation, temporal delta spatial delta 
for 16- and 8-bits quantization respectively across all neural 
network layers. The distribution is measured over all input 
data. These figures show that there is significant potential for 
reduction in the number of computations needed if temporal 
delta is processed using bit-serial multiplier [46] [35] as tem¬ 
poral deltas contain considerably few effectual terms per 
value. Figure 4(e) and (f) show the average effectual bits un¬ 
der different quantization levels: temporal delta contains the 
lowest effectual bits on average, which indicates potential 
speedup. For example, under 8-bit quantization, average ef¬ 
fectual bits for raw activations and temporal delta is 0.87 and 
0.35 for C3D, and 1.68 and 0.6 for I3D, which indicates x2.49 
and x2.8 potential speedup. To be specific, assuming that a 
is an activation of In[d][C][H][W], a ’ is the corresponding in 
the same position of In[d+\^[C\[H\\W\. Rather than calculat¬ 
ing a’*w directly, we could instead calculate it relative to 


a><w: 

a' xw = (ax w) + (a' - a) xw = (ax w) + (A a x 
w) . Note that if we change d to w, 
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Figure 5. Reuse ratio across different input frames of UCF-101. 


S_delta=In[D] [C] [H] [w+l ]-In[D\ [C] [H] [w], 

w=( 0,1,2,.W-l). This is called S-Flow, which could also 

be applied in 2D CONV. 

(3) Third, even though on average temporal delta tends to 
have sparser and contains less effectual bits than spatial delta, 
it still varies across different inputs. We argue that dynamic 
control logic to change between spatial delta dataflow (S- 
Flow) and temporal delta dataflow (T-Flow) should be lever¬ 
aged. First, figure 5 shows the computation reuse ratio across 
spatial dimension and temporal dimension for the different 
input in UCF-101 [27] datasets. As discussed previously, 
temporal delta on average has better sparsity and less effec¬ 
tual bits on average, but the benefits also vary across different 
inputs. Some inputs benefit more from T-Flow (most), while 
others benefit from S-Flow. Second, in some layers, D is very 
small (ranging from 2-4 for the last four layers of C3D). If 
directly using T-Flow, makes the PE along the row (as shown 
later in our design) underutilized, since the temporal/spatial 
dimension is unrolled to 8 (Pd or P w ) in our design. Thus, to 
maximize the benefits, a dynamic control mechanism to 
switch between spatial and T-Flow is needed, varying across 
different inputs and layers. 

Prior works leveraged the temporal locality of consecutive 
frames in continuous mobile vision to design accelerators 
[53][54]. However, they do not consider the additional effec¬ 
tual bit savings. Besides, they only consider the temporal lo¬ 
cality of imaps of the 1 st CONV layer. We leveraged a tile- 
based computation dataflow that can process temporal delta 
bit-serially only on effectual bits. Besides, we leverage a dy¬ 
namic control mechanism to switch between T-Flow and S- 
Flow to maximize the benefits. This can not only be applied 
to 3D CNNs, but also to other general 2D CNNs. 

3. Architecture Design 

3.1 Baseline Accelerator 

We first describe our baseline accelerator bit-parallel 
value-agnostic accelerator (BVA) modeled after DianNao 
[12] and DaDiaNao [55]. We choose this because this is a 
well-understood design and inspires many new accelerators 
later. Figure 6 shows a tile of BVA, which has a weight buffer 
(WB) which provides 16x16 weights per cycle, one per 
weight lane. The tile also has an input activation memory 
(NBin) which provides 16 neurons per cycle, one per activa¬ 
tion lane, and an output activation buffer (NB 0 U t) which can 
accept 16 output activations per cycle. We use P c and P m to 
denote the number of input activations and weights to be pro¬ 
cessed in each cycle, P c =16, P m = 16 here in each tile. In each 


cycle, 16 activations are broadcasted to 16 filter lanes. Only 
one AM slice (16-activation slice) operates per cycle and all 
tiles see the same set of 16 activations. The neuron buffer 
(NB) is single-ported and banked, which is the last-level on- 
chip SRAM. Half of the banks are used for the imaps and the 
other half for the omaps. Both activations and weights are 
read and written to the off-chip memory. The number of fil¬ 
ters (P m ), tiles (Figure 6 is just one tile), weights per filter (P c ), 
precision B (Figure 6 shows the precision as 16-bit) are all 
design parameters that are configurable as necessary. 

BVA features a dataflow that performs P c input feature 
maps and P m output feature maps in parallel. At each clock 
cycle, it handles P c imaps and P m omaps, one activation of 
each output feature map, and one weight of each kernel. It is 
used not only in DianNao [12] and DaDianNao , but also in 
many state-of-the-art FPGA accelerators [56][57][58]. We 
name this as the MIFM-MOFM dataflow. 

For 3D convolution, there are some challenges when di¬ 
rectly applying the baseline accelerator: (1) Put every data 
including imaps , omaps , and weights on chip is infeasible 
since 3D CNN has two more dimensions that make the work¬ 
ing dataset exceed the on-chip memory easily. Actually, this 
is a problem for not only 3D CNN, but also 2D CNN [59]. 
Loop tiling should be leveraged. (2) DaDianNao [55] is 
based on the premise that the storage needed for the input ac¬ 
tivations are far less than the weights. However, for 3D CNN, 
this is not the case. For example, the storage requirements of 
input activations exceed the weight size for the first four 
CONV layer. Memory fragmentation overheads will be 
larger if applying a uniform on-chip buffer design and parti¬ 
tion. (2) Since this dataflow is value agnostic, there is no way 
to leverage the temporal dimension for 3D CNN, and the 



Figure 6. Baseline bit-parallel value-agnostic accelerator. 
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Figure 7. Differential convolution of the temporal delta output activation propagating across the temporal (D) dimension from 
one column to the next, each column corresponds to one i. i=0,l,2...D-l. 


effectual bits reduced by the temporal delta. These factors are 
needed to be considered for accelerator design. 

3.2 Delta Value-Aware Accelerator 

In Section A-E, we first discuss our design using T-Flow, 
and later we add the design about flexibility. 

A. Architectural Factors to be Considered 

Loop Tiling. Since the on-chip buffer is not enough to 
store the feature maps and weights for 3D CNN. The efficient 
way should be to tile the data so that each tile fits the on-chip 
buffer. We use T x to represent the tile of a parameter X. When 
the imap does not fit on chip, the input activation of [C, D , H , 
W] should be broken into tiles of \T C , T d , Th , T w ]. When not 
all M filters fit in on-chip memory, the [M, C, T, R , S] should 
be broken into tiles of \T m , T c , T, 7, 5]. T, R , and S are gener¬ 
ally small values (that ranges between 1 and 7 [23][36]) and 
not considered for tiling. Since Section 2 already character¬ 
izes that the data reuse along the D dimension is high. To en¬ 
sure the high performance along with the temporal delta cal¬ 
culation, we want the data along the D dimension to fit into 
the on-chip buffer as large as possible. For example, if the 
input activation buffer is 108 KB (which the total SRAM of 
Eyeriss [15]). CONV2 layer in C3D has the largest input ac¬ 
tivation size: C=64, 77=16, 77=56, W= 56 (activations stored 
using 1 Byte), choosing a tile factor of T c =64, 7^=16, T/*=10, 
T u = 10 is more favorable than T c =64, 7^=4, Th= 20, T w = 20. We 
cannot store all the input activations and weights on-chip. As 
discussed previously, we opt Td as close as 77. While the first 
tile of In[T h ][T w ] and the next tile have overlap along 77-di¬ 
mension or IF-dimension (referred as “data halo” [60]), we 
opt to take advantage of the overlap and do not re-fetch the 
overlapped region in the 77- and IT-dimension. 

Loop Unrolling/Parallelization. While tiling ensures 
loading the data on-chip partially with better data reuse, loop 
unrolling ensures some dimension can be spatially expanded 
across PEs to ensure the parallelizing multiply-and-adds. We 
leverage MIFM-MOFM dataflow that unrolls along the C-di- 
mension and M-dimension, and extend it by unrolling along 
the 77-dimension with unroll factors P c , P m , and Pd respec¬ 
tively. 77, T m Td , Th , T w denotes the tiling factor to define the 
on-chip buffer size, P c , P m , and Pd are not necessarily equal 
to T c , Tm, and Td. But we are sure: P c < T c , P m < T m , P d <T d . 
For example, as discussed in Section A, T d should be as close 


as D to maximize temporal reuse. However, P d should not 
since D ranges as long as 64 in I3D, if we choose the largest 
one, there will be no design space for P c and P m due to area 
limitation. As for P c and P m , DianNao [12] has P c =Pm=l6 
with area 3.02 mm 2 , DaDianNao [13] has 7 C =16, P m =256 
with area of 67.7 mm 2 in total. These parameters can be con¬ 
figurable based on chip area, design purposes, and using sce¬ 
narios. 

B. Temporal Delta Convolution 

After tiling and unrolling, we consider the temporal delta 
convolution on-chip: for a given output activation out(m , 7, 
h, w), it is possible to compute out{m , 7+1, h, w) differentially 
using the equation as follows: 
out(m, d + 1, h, w) = 

out(m,d,h,w ) + CONV(filter[m],AIn) 

Ain refers to the element-wise deltas of the imap windows 
corresponding to out(m, d + 1, h, w) and out(m, d, h, w): 

A In(m,k,i,j) = 

In(c, k + stride * (7 + 1), i + stride * h,j + stride * w) 
—In(c, k + stride * d, i + stride * h,j + stride * w) 
c = 0,1,2 ....C 

stride is the stride between two adjacent imap window. 
Figure 7 shows an example of differential convolution, which 
applies a 3x3 on three convolution windows along the tem¬ 
poral dimension ( D ). While all columns on the T-Flow archi¬ 
tecture shares the same filters, each column processes one 
convolution activation window along the D-dimension. Di¬ 
rect convolution directly applies to the raw activations; how¬ 
ever, differential convolution only computes the raw activa¬ 
tions for the first window and temporal deltas on the rest. All 
three convolution windows are computed concurrently. Then 
the differential convolution is constructed in a cascaded fash¬ 
ion, as shown in the rightmost part of the figure. First, the 
ACONV2 ’s output activation is calculated as 59, and the 
ACONV3's output activation is calculated as 50. The output 
activation of CONV2 and CONV3 is calculated by adding 
811+59=870, and 870+50=920. 

C. Basic Processing Engine (PE) 

The basic PE builds upon the Bit-Pragmatic accelerator 
(PRA) [35], which processed only effectual bits of input ac¬ 
tivations bit-serially, as shown in Figure 8. After loading 
P c x 8 bits (P c = 16, activations stored as 8bits) input activations, 
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the offset generators convert the activations into a stream of 
effectual powers of two after applying a modified booth cod¬ 
ing. PRA multiplies a weight of the power of two or oneffset 
each cycle using a shifter. To rate match the throughput of 
bit-parallel baseline, we unroll Pd to 8 since the activations 
are stored in 8-bit. In that way, P c x Pd activations can be com¬ 
puted concurrently in multiple bit-serial cycles. PRA uses an 
optimized two-stage shifter scheme, during which each 
oneffset consists of 4 bits: 2-bit for the one-offset, 1 sign bit 
and 1 valid bit. Each bit-serial PE has TVinput adder tree and 
TVshifters (instead of multipliers). Each of them shifts the 
weight directed by the offset. Multiplying activations by fil¬ 
ters is implemented by feeding the activations bit-serially us¬ 
ing the generated oneffsets , and weights bit-parallelly. 

D. Scale-out Design 

We present the scale-out design as shown in Figure 9, 
where the PE is duplicated in two dimensions to improve the 
processing throughput. Since PE uses a bit-serial multiplier 
to perform multiplications only on the effectual bits, it takes 
multiple cycles to accomplish one multiplication. Even 
though the average is only 2-3 cycles, in the worst case, it can 
still take up to 8 cycles (when one input activation is 
11111111) to calculate one product. To maintain the compa¬ 
rable performance of the baseline bit-parallel accelerator, we 
need to simultaneously process P d output activation (only the 
first one is raw output activation, the rest are output activation 
of the temporal delta) along the D-dimension. As shown in 
Figure 9, the vertical parallelism granularity is the number of 
omaps P m , the horizontal parallelism granularity is Pd. The 
parallelism of P c has already been discussed in Section C, 
which is hidden inside the PE itself. The red arrow shows the 
cascaded accumulation along the temporal dimension. 

E. T-Flow 

The parallelism of the temporal dimension offers a multi¬ 
tude of options for processing neurons in parallel. We opt to 
process Pd windows in parallel using a neuron brick to from 
the window in the same row and column pointwise, so that 
the accelerator can process P c x P m output neurons in parallel. 
For example, for a layer with stride 1 in the D-dimension, the 
accelerator can process Pd neuron bricks 1 InB(c , d , h , w). 



Figure 9. Scale-out USPE array, one Processing Unit (PU). 


InB(c , d+\, h, w). .. InB(c , d+Pd- 1, h , w). Where each column 
processes the same convolutional window along the D- di¬ 
mension. The first CONV window is calculated using the raw 
activations, while the rest along the column is computed us¬ 
ing the temporal delta. We conservatively unroll Pd to 8 to 
compensate for the cycle loss by bit-serial processing. 

In this architecture, we only need to calculate the first 
imaps in the D-dimension and the remaining outputs along D 
differentially. We do this so that we can buffer the data along 
D completely using the height Th and width T w . Once all the 
input activations along Th x T w are consumed, the next sets of 
Th x T w input activations are loaded on-chip. The on-chip 
SRAM is double-buffered so that the on-chip MAC compu¬ 
tation can be pipelined with the loading of the next set of data 
from off-chip. 

The accelerator computes each output in the temporal di¬ 
mension in two phases: (1) first, T-Flow calculates the first 
imap using raw data in parallel with computing the rest using 
the temporal delta; (2) second, the data are propagated in cas¬ 
caded fashion with just a single addition per output. Since the 
first phase needs tens of hundreds of cycles, the second phase 
just takes several cycles to finish using a set of adders. The 
second phase can pipeline with the computation of the next 
set of windows along the temporal dimension. The data path 
of the second phase for cascaded accumulation is shown in 
Figure 9 as the red arrow. 

We use the scheme of store “the first column input activa¬ 
tions as raw, and the rest as deltas” way. Since storing all val¬ 
ues along the D-dimension as raw activations incur recompu¬ 
ting overhead every time activations are read out. Besides, 
this can save additional off-chip memory traffic. Each col¬ 
umn’s NBout refers to the output neuron brick of next layer. 
We use a SIMD engine, Delta ou t, containing (1) a Pd- to-1 
multiplexer, (2) ALUs to perform activation function to gen¬ 
erate the activation deltas of the next layer, and (3) output 
activation neuron brick buffer. Computing the delta bricks of 
the next layer contains two phases: (1) read out the output 
bricks to the stride (Stride ne xt) left to NB 0U t, going through the 
NBout, and store them into the output brick buffer of the 


1 The neuron brick InB(c, d, h, w) equals to InB{c , d , h, w), InB(c+ 1 , 
d , h , w )... JnB{c+P c - 1 , d , h , w). 
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Figure 10. Configurable last-level SRAM, on-chip buffer. 
Blue blocks are programmable. The bank assign logic out¬ 
puts a 2B-bit vector to indicate whether each bank belongs to 
input, output or weight. 

engine; (2) reading the output brick of current column’s NB 0U t, 
passing through the activation function, and use the SIMD- 
adder to perform element-wise subtractions to generate delta 
of next layer, and then store them into the last-level on-chip 
SRAM. The delta values can be stored as dynamic precisions 
[61 ] of groups to reduce off-chip energy. 

Supporting S-Flow: note that all the above discussions 
from Section A-E refer to temporal delta dataflow. To sup¬ 
port spatial delta dataflow, we just need to replace all the 
above discussions relating to d to w. That is, processing the 
spatial delta (instead of the temporal delta) of different input 
CONV window along fF-dimension. The loop unrolling fac 
tor is P c , P m and P w . Pd is replaced with P w , which both indi¬ 
cates the PE columns in the scale-out design in Figure 9. Next, 
let’s talk about configurability. 

F. Configurability 

Configurable last-level on-chip SRAM: The goal of 
providi ng buffer configurable on the last level on-chip 
SRAM is to consider that the optimal tile size (input activa¬ 
tions, output activations, and weights) varies across layers. 
I.e. Since the on-chip SRAM is fixed after design decision is 
made, for one layer, the on-chip buffer should be allocated 
more to the activations and less to the weights, and for other 
layers, the on-chip buffer should be allocated more to the 
weights and less to the activations. Configurability allows 
that and minimize the internal fragmentation. The design is 
only used in the last-level SRAM. Banks are allocated to each 
data type contiguously, and “Bank Assign” registers are con¬ 
figured at each layer start time to denote the range of banks 
used to store each data type. The parallel demultiplexer is 
used to index into and read/write data into/out of each group 
of banks. The output (read) mux is replicated for each of the 
three types reads one word per access, there are no bank con¬ 
flicts. Programmable FSMs (Sequencers) are used to generate 
address patterns into each group of banks. High-order address 
bits, along with the bank assignment registers, determine 
which bank is responsible for each type. We configure this 
with 16 banks, and banking does not show very little over¬ 
head [62]. The traffic between the last-level SRAM and the 
tiles of PUs (as shown in Figure 9) are implemented through 
NoC. The NoC manages the data delivery between buffer and 
PE arrays. It is implemented using broadcast style network. 

Dynamic switch between S-Flow and T-Flow: The dif¬ 
ference between T-Flow and S-Flow [52] is: (1) the former is 


unrolled into P c , P m , and Pd (current number of frames un¬ 
rolled in temporal dimension), the latter is also unrolled into 
P c , P m , and P w (concurrent number of adjacent CONV win¬ 
dow unrolled in spatial dimension W). (2) The former opts to 
accommodate enough imaps across the D-dimension by mak¬ 
ing Td as large as possible; while the latter accommodates 
enough imaps by making T w as large as possible. 

The priority of loading data on-chip along which dimen¬ 
sion first is decided by the control logic of the on-chip buffer, 
which is decided by the loop order [d, h , w, m , c\. Thus, the 
control logic of the on-chip buffer decides either the order of 
loading which on dimension first complete on-chip. 

Loop order determines the sequential data loading order of 
the CONV loops. There are two kinds of loop order: intra¬ 
tiling and inter-tiling loop orders. Inter-tiling loop orders de¬ 
termine the data movement from off-chip memory to on-chip 
buffer. The intra-tiling loop order determines the pattern of 
data movement from the on-chip buffer to PEs. We consider 
the inter-tiling loop orders, which indicates the dynamic FSM 
(sequencer) in Figure 10. To dynamically switch between S- 
Flow and T-flow, we leverage a dynamic FSM to generate 
addresses into the on-chip buffer. 

The sequencer is used to generate address into the on-chip 
buffer, count how many MACCs to perform, and when to 
read/write psums relative to performing MACs, when pro¬ 
cessing all tiles are done. When the loop order of loading data 
on-chip changes, loop bounds and memory access to each tile 
changes. To ensure the control flexibility, the FSM is pro¬ 
grammed by setting two sets of configurable registers that de¬ 
note the loop bounds and loop steps and use loop counter to 
iterate between the specified loop orders. I.e. the loop bound 
for [D, H, W, M, C] is [0->D, 0 ->H, 0 ->W, 0 ->M, 0->q, 
while the loop step is [Td, Th, T w , T m , T c ]. The FSM walks 
through the loop using the loop bounds and accumulates a 
step into an output register. For the Z-level (Z=5) loop, the 
user specifies bounds and steps And the 

iteration indexes are z’o,. . .,z'd-i. The data loading order behaves 
like the software iterations. When entering each state, the 
FSM outputs the current value in the output register and one 
of the steps Sj is added to that register for the next iteration 
(similar to [32][63]). j equals to which loop is currently ter¬ 
minating. Different loop orders and steps give different ad¬ 
dress sequences need by the on-chip buffer. 

There are two different kinds of loop orders and steps to 
choose for the registers, which are determined by the S-Flow 
and T-Flow. We leverage the offline software optimization 
framework [64] that pre-analyzes 3D CNNs and finds the op¬ 
timal tiling and inter-tiling loop order of S-Flow and T-Flow, 
with S-Flow has restrictions that W=T w , T-Flow has re¬ 
strictions that D=Td. We use two signals: “temporal_signal” 
and “temporal_flow” signal to choose whether to use T-Flow 
or S-Flow. Before entering the CONV layer, the input is first 
profiles to compare the reuse ratio of spatial delta and tem¬ 
poral delta: if the temporal delta is sparser, “temporalsignal” 
is set on, else set off. Then for each layer, if temporal_signal 
is set on, and D>=the number of PE columns, temporalflow 
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signal is set on and set the optimal T-Flow’s loop order and 
tiling in the on-chip buffer’s sequencer, else always set tem- 
poral_flow signal to off, indicating to use S-Flow and set the 
optimal S-Flow’s loop order and tiling in the sequencer. With 
this control flow, we ensure the optimal off-chip loading en¬ 
ergy efficiency. 

The temporalflow signal also controls the unrolling fac¬ 
tors. The computation order for S-Flow is pw->Pm->Pc , and for 
T-Flow is pd->Pm->Pc • To achieve that, we also change the 
data address generation pattern from the on-chip buffer to 
PEs. For T-Flow, different columns of PEs in the scale-out 
design are loaded with the inputs corresponding to different 
CONV windows in the D-dimension; for S-Flow, different 
columns of PEs are loaded with the inputs corresponding to 
different CONV windows in the fF-dimension. Besides, 2D 
CONV is directly supported using S-Flow by setting tem- 
poralflow to off. 

Note that under our control flow, only two cases can hap¬ 
pen for all layer’s execution: (1) the former layers use T-Flow, 
the latter layers use the S-Flow. In this case, during the layer 
that switches from T-Flow to S-Flow in the next layer, the 
spatial delta value of the next layer needs to be computed sep¬ 
arately. (2) all layers use S-Flow. In this case, all inputs are 
stored as spatial delta. 

4. Benefits to Other CNNs 

Our TSVA dataflow could not only benefit 3D CNNs, but 
also 2D CNN inferences. First, for those workloads running 
under continuous mobile vision scenario, pixel changes 
across continuous frames are not arbitrary. Also, in most 
cases, several continuous frames contain the same objects to 
be detected. Thus, it is possible to send multiple frames into 
neural networks and fuse several imaps (with D larger than 8) 
running concurrently into the neural networks using T-Flow 
if the users care more about throughput instead of latency. In 
that case, lots of redundant computations can be saved. Based 
on the statistics published by Google, at least 29% of 
Google’s data center workloads are sequence processing. Be¬ 
sides, S-Flow can be directly applied to 2D CNN. 

5. Evaluation 

This section evaluates the performance and energy con¬ 
sumption of TSVA. First, we present our evaluation method¬ 
ology. Second, we present our speedup by TSVA. Then, we 
present the area information, power breakdown, energy effi¬ 
ciency, and off-chip memory energy savings. 

A. Evaluation Methodology 

We have developed a cycle-accurate simulator to model 
the performance of all architectures. We model an accelerator 
of four tiles. The designs are implemented in Verilog and syn¬ 
thesized through the Synopsys Design Compiler [65], with 
the TSMC 65nm library. We use CACTI [62] to model the 
area and power consumption of the on-chip SRAM memories. 
The accelerator frequency is set at 1GHz. We model a DRAM 
with 4GB LPDDR4-2400 channels, the DRAM energy is es¬ 
timated using information in Micron technote [66]. We 


Table I. Parameters for the accelerator 


# of Tiles 

4 

SRAM Banks 

64KBxl6 

Technology 

65 nm 

Off-chip Memory 

4GB LPDDR4-2400 

Filters/Tile 

16 

Weights/Filter 

16 

Frequency 

1GHz 

NBin/Tile 

2KB 

NBout/Tile 

2KB 


compare our design with state-of-the-art schemes, including 
the BVA tile design by DianNao [12] and DaDianNao [13], 
bit-serial accelerator Bit-pragmatic [35]. Since our accelera¬ 
tor only targets inference, as prior work claimed that 8 bits is 
already good enough for inference [50]. We store the 
weights/activations using 8 bits, both for the baseline and our 
design. We further compare the speedup when the activations 
are quantized to 5 bits. 

The hardware unrolling parameters for the baseline BVA 
design is 4 PU tiles with each P c =16, P m = 16. To match the 
performance, for bit-serial design, we choose a 4 PU tiles 
with each PU’s P c = 16, P m = 16, with Pd=%- There are total 
64KBxl6 banks, which works as the last-level on-chip 
SRAM to store input, output, and weights. The data dis¬ 
patcher, offset generator, engine for calculating and cascad¬ 
ing delta value, and control logic also has some overhead that 
need to be considered. The default configurations are re¬ 
ported in Table I. 

B. Evaluated Networks 

We evaluated the following 3D CNNs based on their pop¬ 
ularity and state-of-the-art research: (1) C3D [23], one of the 
most widely used 3D CNN for video recognition by Face- 
book. To evaluate C3D, we use videos from UCF101 [67] 
benchmark suite; (2) I3D [36], since it currently holds state- 
of-the-art results on Kinects [36] video dataset. (3) 3D Res- 
Net-50 [37], a 3D version of ResNet-50. We also use 
UCF101 dataset for 3D Resnet-50; (4) YOLO [68] and SSD 
[69], two most widely used object detection neural networks. 
The public object detection benchmarks (Pascal VOC or 
COCO datasets) cannot be used since they mainly contain 
standalone images. Instead, we capture a series of videos and 
extract image sequences. Each image is manually annotated 
with bounding boxes and labels. We fuse 8 frames as a batch 
for every layer’s execution, (i.e. D= 8, 7M). 

C. Speedup 

We first compare the speedup of TSVA with the baselines. 
Note that the baseline BVA also use 8-bit multiplier. The bit- 
serial always unroll and processes different neuron bricks 
along the IF-dimension (P w ). The speedup is shown in Figure 
11, which mainly demonstrates the speedup caused by re¬ 
duced effectual bits processed. On average, the speedup is 
1.35x towards bit-serial, and 4.24x towards BVA. The 
speedup is caused by the dynamic control between spatial 
delta and temporal delta to maximize the delta’s effectual bit 
saving in computation. To further shows the benefits of 
TSVA, we quantize the input activations to 5-bits to consider 
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C3D I3D ResNet-3D YOLO SSD 


Figure 12. Speedup with 5-bit quantization enabled. 

the speedup with a little loss of NN accuracy. As shown in 
Figure 12, it further brings a speedup of an additional 15.3% 
using the same architecture. The additional speedup is caused 
by the decreased effectual bits saving by quantization. 

E. Area and Energy Efficiency 

We also evaluate the area information of TSVA. The re 
suits are shown in Figure 13. Since we leveraged an 8-bit par 
allel multiplier instead of a 16-bit multiplier in the baseline, 
the PE area greatly decreases and consumes much less area 
percentage than the on-chip SRAM subsystem. The con trol 
logic for S-Flow and T-Flow switching added is not signifi¬ 
cant, comparing to the computation PE and the SRAM on- 
chip buffer. Overall, the area is xl.64 more than the BVA 
accelerator, mainly increased by the computation logic. 
TSVA is 22.3 mm 2 , while BVA is 13.58 mm 2 in total. 

Figure 14 reports a breakdown of the power for the TSVA 
and BVA. While TSVA consumes more power than BVA 
(around x2.98), the speedup is higher than the increase in 
power consumption, which results in an overall x 1.42 more 
energy efficient than the baseline. Note that this is only the 
on-chip computation energy consumption, the off-chip 
memory transmission consumption is orders or magnitude 
than on-chip memory accesses. 

While storing the deltas using dynamic precisions can save 
the off-chip DRAM access energy greatly, we also show that 
our configurable buffer that determines the optimal loop or¬ 
der could save another 38%, 33%, and 27% of off-chip 
DRAM access energy for the three 3D CNNs respectively. 

6. Related Work 

Neural network accelerators. The huge computational 
requirements and applicability of deep neural networks have 
prompted researchers to design numerical accelerators, ei 
ther on FPGA [70][57] [56][58][71], ASIC [12][13][14][72] 
[15] [50][60], or leveraging efficient memory technologies 
[73][74]. The representative works are DianNao [12], DaD- 
ianNao [13] utilize large global buffer as shared storage to 
minimize the DRAM access energy consumption. Eyeriss [15] 
proposes a row stationary dataflow by exploiting local data 




CL 


DO 


OG 


DIS 


SRAM 


PE 


Figure 13. Area breakdown [mm 2 ] TSVA vs BVA. CL: control 
logic, DO: Delta ou t: the SIMD engine for delta calculation of 
net layer., OG: Offset generator, DIS: Dispatcher. 



TSVA BVA 

Figure 14. Power breakdown. CL: control logic, DO: Del¬ 
taout, OG: Offset generator, Dis:Dispatcher. 
reuse for both filters and input activation maps. Several re¬ 
cent works that leverage bit-serial multiplier [46] [35] [19] and 
bit-serial cache [75]. 

3D CNN accelerators. Prior work on 3D CNN accelera¬ 
tion also lies in either FPGA [31][76], or ASIC [32]. Hedge 
at al. [32] designed a flexible 3D CNN accelerator that fea¬ 
tures flexible buffer and loop control. We believe that bit-se¬ 
rial processing will benefit more. 

Accelerators that exploiting spatial and temporal local¬ 
ity. There are several research works that leverage the spatial 

[52] and temporal locality [53][54][45] to accelerate the exe¬ 
cution of CNN. Diffy takes advantages of the spatial correla¬ 
tion between adjacent convolution windows and introduces 
“Differential Convolution” architecture and store the spatial 
delta values both off- and on-chip reducing the amount of 
storage and communication needed. We can leverage the ben¬ 
efits of both the spatial and temporal delta. Zhu et al. 

[53] leverage motion information of continuous frames using 
motion estimation to improve the execution of DNN. The 
same idea was proposed [54] by implementing an embedded 
vision accelerator as a co-processor for DNN at the same time. 
Riera et al. [45] propose an input reuse design that can lever¬ 
age the temporal locality within layers. Our design is differ¬ 
ent since we only need to cache the input and output activa¬ 
tions of one layer, instead of all layers. 


7. Conclusion 

In this paper, we propose TSVA, an accelerator for 3D 
CNN and other NNs under continuous frame batch pro¬ 
cessing scenario. The evaluation shows that TSVA achieves 
x4.24 speedup and xl.42 energy efficiency. While the video 
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based inference is more and more important in people’s daily 
life, we believe that our work will benefit more in the future. 
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